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ABSTRACT 


The  application  of  neural  networks  to  the  detection  of  variance 
transitions  in  Gaussian  noise  is  analyzed.  The  problem  is  a  bench¬ 
mark  example  of  hypothesis  testing  on  a  nonstationary  stochastic 
process.  Comparisons  among  perceptron,  back  propagation  nets, 
and  classical  (Neyman-Pearson)  decisioning  are  provided  by  Monte 
Carlo  simulation.  False  alarm  and  detection  probabilities  for  classi¬ 
cal  and  neural  net  algorithms  are  computed  and  compared. 
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NEURAL  NETWORK  PERFORMANCE 
ON  THE  STOCHASTIC  EXCLUSIVE-OR  PROBLEM 


I.  INTRODUCTION 


In  this  report  the  application  of  neural  networks  to  the  detection  of  variance  transitions  in 
Gaussian  noise  is  considered.  The  problem,  which  consists  of  transition  detection  between  a  pair 
of  input  sample  variances,  is  a  benchmark  example  of  hypothesis  testing  on  a  nonstationary  sto¬ 
chastic  process.  In  the  case  of  neural  net  algorithms  the  testing  of  hypotheses  results  from  the 
decision  space  output  of  the  last  layer  of  neurons. 

Variance  transition  detection  in  a  Gaussian  random  process  is  probably  the  most  tractable 
example  upon  which  to  study  machine  analysis  of  a  nonstationary  stochastic  process.  It  is  an 
obvious  test  bed  for  the  analysis  of  neural  network  decisioning  of  stochastic  data.  In  this  report 
the  modeling  of  the  process  as  Gaussian  is  dictated  by  the  desire  for  a  comparison  of  neural  net¬ 
work  and  classical  detection  techniques.  For  specific  applications,  other  data-derived  parameters, 
such  as  correlation  length,  may  be  more  appropriate  for  the  analysis  of  transitions.  It  is  expected, 
however,  that  neural  network  structures  required  for  hypothesis  testing  are  independent  of  the 
nature  of  the  sufficient  statistics;  dependent  instead  upon  the  pattern  of  mean  values.  The  pur¬ 
pose  of  this  research  is  to  provide  a  theoretical  foundation  for  work  currently  being  done  on 
satellite  maneuver  detection.  The  detection  of  maneuvers  is  performed  by  neural  networks  which 
have  been  trained  upon  signal  variances ,  spectral  widths,  and  autoregressive  model  coefficients 
from  radar  cross  section  data.  This  research  is  also  relevant  to  the  data  fusion  effort,  in  which 
these  parameters  are  obtained  from  different  sensors. 

The  definition  of  a  neural  net  is  summarized  in  Figure  1.  It  consists  of  a  series  of  parallel 
input  lines  connected  to  a  set  of  layered  neurons.1  Each  neuron  performs  a  prescribed  (generally 
nonlinear)  operation  on  the  set  of  values  applied  simultaneously  to  the  input,  and  outputs  a  vaiue 
to  the  right  along  a  connection.  The  only  function  of  the  connections  is  multiplicative  weighting. 
Overall  net  output  from  the  last  layer  of  neurons  consists  of  elements  in  a  binary  decision  space 
reflecting  various  hypotheses  about  the  data.  Knowledge  about  the  mapping  between  data  and 
decision  spaces  is  stored  in  the  connection  weights  {W-}  shown  in  Figure  1.  These  values  are 
adjusted  by  a  learning  algorithm  which  uses  data  for  which  correct  hypotheses  have  been  identi¬ 
fied.^  It  is  exactly  the  distributed  storage  of  information  (connection  weights)  which  suggests  that 
powers  of  association  and  robustness  of  human  intelligence  may  be  realized  in  neural  net 
algorithms. 

The  fundamental  problems  of  neural  net  design  involve  net  structure  and  learning  algo¬ 
rithms.  A  particular  mapping  can  be  realized  by  infinitely  many  net  structures.  Although  work 
has  been  done  to  identify  properties  of  net  functioning  which  depend  upon  structure,  particularly 
•n  the  area  of  learning  time2  and  memory  capacity,3-4  the  overall  principles  of  structure  which 


Figure  1.  Generic  neural  net:  Connection  weights  I W ^ j.  for  neurons  i  and  j;  thresholds  {0;j  for  each  neuron  i. 

crovern  function  have  yet  to  be  discovered.  An  issue  which  arises  in  hypothesis  testing,  and  pre¬ 
cedes  the  problem  of  neural  net  structure,  is  the  comparison  of  neural  net  and  classical  test  per¬ 
formance.  While  it  is  probable  that  neural  net  analysis  succeeds  in  areas  of  decisioning  in  which 
classical  techniques  fail,  the  quantitative  comparison  of  tests  for  problems  amenable  to  both 
techniques  deepens  an  understanding  of  the  possibilities  of  neural  hypothesis  testing. 

In  this  report  a  Gaussian  random  process  is  considered  in  which  a  transition  of  the  variance 
is  possible.  Sample  variances  before  and  after  the  supposed  transition  form  a  pair  of  x~  distrib¬ 
uted  sufficient  statistics  which  are  input  to  the  net.  The  input  of  sufficient  statistics  in  fact  em¬ 
bodies  the  application  of  prior  information  about  the  process  to  aid  net  decisioning.  In  principle, 
entire  data  sequences  could  be  input  to  the  net,  which  would  then  be  forced  to  discover  features 
of  the  data  (such  as  sufficient  statistics)  upon  which  to  base  decisions.  Net  operation  upon  suffi¬ 
cient  statistics  is  a  simple  relevant  example  of  modular  neural  net  design;  that  is,  prior  rule  struc 
ture  (compute  sufficient  statistics)  grafted  upon  parallel  neural  decisioning  (test  hypotheses).5 

Section  II  of  this  report  consists  of  a  review  of  classical  hypothesis  testing  for  the  case  of 
transition  detection  in  a  Gaussian  random  process.  False  alarm  and  detection  probabilities  are 
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expressed  in  terms  of  the  threshold  parameter  used  in  the  hypothesis  test.  The  optimum  thresh¬ 
old,  corresponding  to  maximum  detection  and  minimum  false  alarm  probabilities,  is  computed 
for  a  number  of  different  noise  and  sampling  conditions.  These  probabilities  form  the  basis  for 
the  comparison  of  neural  network  and  classical  decisioning. 

False  alarm  and  detection  probabilities  for  neural  networks  are  ensemble-derived  from 
Monte  Carlo  simulation.  Starting  with  a  random  set  of  connection  weights  the  networks 

were  trained  on  a  stochastic  data  set  by  application  of  known  learning  algorithms  (either  percep- 
tron  convergence  or  back  propagation).  False  alarm  and  detection  probabilities  were  computed 
directly  from  the  proportion  of  successful  mappings  obtained  on  a  performance  ensemble  of  sto¬ 
chastic  data.  Sections  III  and  IV  contain  the  results  of  the  Monte  Carlo  analysis  of  perceptron6-7 
and  back  propagation  networks,2  respectively.  Transition  detection  for  various  noise  conditions 
and  sampling  windows  is  studied  as  a  function  of  the  applied  number  of  iterations  of  the  learning 
algorithms.  Section  V  contains  quantitative  comparisons  between  classical  (Neyman-Pearson)  and 
neural  network  performance.  This  is  obtained  from  the  asymptotic  (large  iteration  number)  values 
of  Pf  and  Pd  for  neural  networks,  and  the  corresponding  classical  probabilities  at  the  optimum 
threshold. 


II.  VARIANCE  TRANSITIONS  IN  GAUSSIAN  NOISE 


As  mentioned  in  the  introduction,  an  issue  which  precedes  net  design  for  hypothesis  testing 
is  the  quantitative  comparison  of  neural  net  and  classical  decisioning.  This  is  possible  in  a  Gaus¬ 
sian  process  in  which  an  identified  sufficient  statistic  such  as  variance  exists.  In  this  section  the 
Neyman-Pearson  test  for  variance  transitions  is  derived,  and  relevant  conditional  probabilities  are 
numerically  computed.8 

The  classical  approach  to  transition  detection  for  a  Gaussian  process  requires  the  definition 
of  the  statistic,8 

N 

X=Vy;  .  (I) 

!- 1 

where  •  y , ! i  =  1,2 . V  is  the  zero  mean  stochastic  data  in  a  window  of  length  (N).  The  variable 

X  is  ^--distributed  with  a  probability  density 

X(N  2-1)  eXp  r_x  2a2] 

p(x)=  - — - 1  (2) 

2n  2onT(N  2) 

where  x  is  the  standard  deviation  of  the  Gaussian  random  variable  y.  In  the  case  of  two  possible 
deviations.  a0  and  aj,  classical  hypothesis  testing  results  from  the  definition  of  a  threshold  y 
which  determines  the  decision:  that  is,  x  greater  (less)  than  y  implies  noise  deviation  a,  (a0). 

The  computation  of  false  alarm  and  detection  probabilities  for  transition  detection  requires 
the  definition  of  conditional  probabilities  { p[( i-j)  1  (l.m)]  |  i.j.l.m  t  (0,1)}.  Each  pair  (i,j)  corre¬ 
sponds  to  a  (before,  after)  variance  condition,  where  an  index  i  of  1  or  0  denotes  a  high  or  low 
variance,  respectively.  The  conditional  probability  p[( i.j)  |  (l.m)]  represents  the  detection  of  noise 
condition  (i.j)  when  the  window.,  truly  correspond  to  deviation  pairs  (l.m).  Assuming  independent 
tests  of  each  data  window,  the  conditional  probability  factorizes 

p[(i.j)  j  (l.m)]  =  p(i|  I)  p(j|m)  ,  (3) 

where  p(ilj)  denotes  the  probability  of  choosing  deviation  i  with  a  true  noise  deviation  j.  The  pair 
of  decisions  required  to  analyze  a  transition  is  based  upon  the  values  of  x  in  Equation  (1)  for 
two  data  windows  and  the  threshold  y  (as  described  above). 

Detection  and  false  alarm  probabilities  for  transition  detection  can  be  expressed  in  terms  of 
single  segment  hypothesis  lest  probabilities  by  application  of  Equation  (3).  It  is  shown  in  the 
Appendix  that  the  transition  hypothesis  test  probabilities  are  given  by 

Pd  -  p( transition!  transition)  =  [p(  1 1 1 )  p(0|0)  +  p(  1 10)  p(0|  1 )]  (4) 

and 

Pt  '  p( transition!  no  transition)  =  [p(  1 1 0)  p(0|0)  +  p(  i|  I)  p(0|  I)]  (5) 


It  is  interesting  to  note  that  the  expression  in  Equation  (4)  is  independent  of  the  prior  probabili¬ 
ties  of  the  deviation  pairs  (i,j),  whereas  Equation  (5)  requires  that  all  variance  pairs  have  equal 
probability, 

p(l,0)  =  p(0,l)  =  p(0,0)  =  p(l,I)  =  1/4  (6) 

This  condition  was  maintained  in  Monte  Carlo  simulations  which  are  described  in  following  sec¬ 
tions.  The  conditional  probabilities  appearing  in  Equations  (4)  and  (5)  are  given  by 

p(  I|i)  =  J*  *  Pj(x)  dx  (7) 

y 

and 

p(0|  i)  =  J*30  pj(x)  dx  ,  (8) 

O 

where  p,  is  the  function  p(x)  in  Equation  (2)  with  a  given  by  a0  or  a|  for  i  of  0  or  1, 
respectively. 

The  relationship  between  Pf  and  Pd  as  the  threshold  y  is  varied,  known  as  the  receiver  oper¬ 
ating  characteristic  (ROC),  characterizes  the  hypothesis  test.  Equations  (4)  and  (5)  refer  to  a  deci¬ 
sion  on  the  existence  of  a  variance  transition  in  a  Gaussian  random  process.  The  situation  of  no 
transition  is  distinguished  from  high  low  or  low,  high  variance  transition.  The  similarity  of  the 
test  to  the  binary  exclusive-or  map  suggests  the  identification  as  the  stochastic  exclusive-or  test. 

As  revealed  in  Equations  (4)  to  (8),  detection  and  false  alarm  probabilities  for  classical  tran¬ 
sition  detection  depend  upon  the  threshold  parameter  7.  Figures  2(a-d)  to  4(a-d)  contain  plots  of 
false  alarm  and  detection  probability  as  a  function  of  threshold  for  noise  deviations  ct0  of  one 
and  o,  of  two,  four,  and  eight.  As  discussed  in  Sections  III  and  IV,  these  are  the  deviations  upon 
which  neural  net  performance  was  evaluated.  The  figures  within  each  set  correspond  to  different 
widths  N  of  the  sampling  window.  The  plots  were  constructed  by  numerical  computation  of  Pt 
and  Pd  for  each  7  from  Equations  (4)  to  (8). 

For  each  pair  of  deviations  (ao,ai)  and  each  sampling  window  N,  the  maximum  of  detection 
and  minimum  of  false  alarm  probability  occur  at  intermediate  values  of  threshold  parameter.  At 
large  values  of  v'^"  (cr j  -  a0)  the  optimum  threshold  was  not  obtained  numerically  within  the  fig¬ 
ure  domain  [see  Figures  4(a-d)].  Figure  5  contains  the  false  alarm  probability  curves  for  fixed 
window  (N  =  10)  as  tf|  is  varied.  Note  that  the  optimum  threshold  slowly  increases  with  the  value 
o |  for  fixed  window  widths.  The  optimum  threshold  is  obtained  from  the  condition 


which  from  Equations  (4)  to  (8)  is  expressed  as 

p,(7)  [l  2  -  2  J  /  p0(x)  dx]  +  p0(7)  [l  2  -  2  J7  p,(x)  dx]  =  0  (K 

o  o 

Although  Equation  (10)  can  be  readily  solved  by  Newton’s  method,  values  of  the  optimum 
threshold  (and  corresponding  Pf  and  Pd)  were  obtained  directly  from  the  plots  in  Figures  2(a-d) 
to  4(a-d).  This  provided  sufficient  accuracy  for  the  comparison  with  neural  net  performance,  as 
summarized  in  following  sections. 
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Figure  4.  False  alarm  and  detection  probability  vs  threshold  parameter ;  noise  deviations  oq-  1.0  and  a 
(a)  2-sample  window,  (b)  6-sample  window,  (c)  10-sample  window,  and  (d)  20-sample  window. 
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III.  MONTE  CARLO  TEST  OF  THE  PERCEPTRON 


In  this  section  the  performance  of  a  perceptron  neural  network  on  the  stochastic  exclusive-or 
problem  is  analyzed.  False  alarm  and  detection  probabilities  as  a  function  of  training  iteration 
are  computed  by  Monte  Carlo  simulation.  Figure  6  contains  the  minimal  perceptron  net  required 
to  perform  the  stochastic  exclusive-or  mapping.7  Sample  variances  (xj,  X2)  computed  from  a  pair 
of  nonoverlapping  data  windows  form  the  net  inputs.  The  initial  operation  of  the  net  in  Figure  6 
is  the  computation  of 

A0  +  Al*i  +  A2X2  +AlX|X2  -  (11) 

where  (A0,A|,A2,Aj)  is  the  connection  vector.  A  transition  (or  no  transition)  is  implied  if  the 
quantity  in  Equation  (11)  is  positive  (or  negative).  Variances  are  computed  on  fixed  length  win¬ 
dows  taken  before  and  after  the  point  in  which  the  transition  may  have  occurred.  The  variances 
form  a  pair  of  x2-distributed  random  variables  with  mean  value  vNa,  where  N  is  the  window 
length  and  a  the  standard  deviation  of  the  Gaussian  process.  A  schematic  representation  of  the 
hypothesis  test  as  a  mapping  is  shown  in  Figure  7. 

Connection  weights  (Aj.A2.A3)  and  the  threshold  A0  are  adjusted  from  initially  random 
values  by  application  of  a  training  ensemble  of  input  variance  pairs  (for  which  the  existence  of  a 
transition  is  known).  The  training  algorithm  is  exactly  that  which  appears  in  the  perceptron  con¬ 
vergence  theorem  of  Reference  7.  A  normalized  predicate  vector  is  defined  from  the  input  pair 
(X1.X2)  as, 

(I.X1.X2.X1X2)/  x]  +  X22+  (X1X2)2  •  U2) 


Figure  6.  A  minimal  neural  net  structure  for  transition  detection  based  upon  the  exclusive-or  logic  element. 
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Starting  with  a  random  connection  weight-threshold  four-vector, 

A  =  (Aq,Ai,A2,A3)  (13) 

alter  A  if 

*  /*« 

(1)  corresponds  to  a  transition  and  A  •  4>  <  0, 

/V  A 

(2)  4>  corresponds  to  no  transition  and  A  •  >  0, 

In  either  case  adjust  A  additively  with  the  normalized  predicate  vector, 

A<»>  =  AC-')  ±  4>  ,  (14) 

where  addition  and  subtraction  correspond  to  cases  (1)  and  (2),  respectively.  The  effect  of  Equa¬ 
tion  (14)  is  the  adjustment  of  the  product  A  •  £  positively  for  a  transition  and  negatively  for  no 
transition. 


Figure  7.  Schematic  representation  of  stochastic  exclusive-or  map  for  variance  transition. 
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The  use  of  the  network  in  Figure  6,  with  the  corresponding  four-vector  A,  is  motivated  by 
the  linear  separation  principle  of  perceptron  learning.7  Consider  a  two-dimensional  space  consist¬ 
ing  of  a  before  and  after  variance  pair  (X1.X2)  from  an  ensemble  of  stochastic  data.  As  shown  in 
Figure  8,  the  task  of  transition  detection  is  to  differentiate  a  region  of  probability  clouds  centered 
at  (V^Nao,  \/Nao)  and  (x/Noj,  x/N<j|)  from  probability  clouds  at  (x/Noq,  \/Na|)  and  (\/Nct|, 
x/*Noq).  As  shown  in  Figure  9,  the  embedding, 

(X 1  2)  — ►(X  1>X2’X  1X2)  (15) 

enhances  the  linear  separability  of  these  regions,  thereby  suggesting  the  structure  in  Figure  6. 


Figure  8.  Tw  o-dimensional  representation  of  variance  transition  detection.  Distinguished  regions 
centered  at  \(\J~Noq,  sTNoq),  (\J~No t,  \J~No / ) }  and  \(\f~NoQ,  \J~No j).  (\J~No /,  \TNoq)\. 


XlX2 


V  N  O0 


v  N  cr1 


Xi 


Figure  9.  Embedding  of  transition  detection  map  to  three  dimensions.  Enhanced  linear  separability 
of  regions  in  Figure  S. 


In  order  to  model  perceptron  performance  on  stochastic  data,  a  training  ensemble  of  NT 
window  pairs  was  created.  Each  window  contained  N  samples  of  Gaussian  random  noise  with 
standard  deviation  a0  or  <j\.  The  training  algorithm  was  applied  to  a  set  of  variance  pairs 
[*(p),  XiP)]  which  as  a  function  of  p  sequence  through  possible  noise  conditions  (oj,o0),  (a^oj), 
(a0,O|),  and  (a0,a0).  The  range  of  the  net  mapping  corresponding  to  this  sequence  alternates 
between  transition  and  no  transition  hypotheses.  Upon  completion  of  a  number  of  training  itera¬ 
tions  a  performance  ensemble  was  created  which  contained  500  transition  and  no  transition  trace 
pairs.  Ensemble-derived  detection  and  false  alarm  probabilities  were  obtained  simply  by  counting 
the  number  of  detected  transitions  and  falsely  detected  no  transitions  in  the  ensemble.  Detection 
and  false  alarm  probabilities  are  functions  of  the  number  of  iterations  of  the  training  algorithm. 

Figures  10(a-d),  ll(a-d),  and  12(a-d)  contain  false  alarm  and  detection  probability  vs  NT  for 
standard  deviation  a0  of  one  and  a  \  the  two,  four,  and  eight,  respectively.  Within  each  figure  set 
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the  length  of  the  windows  is  varied  such  that  (a)  to  (d)  correspond  to  increasing  sample  number 
N.  Note  that  in  all  cases  detection  and  false  alarm  probabilities  were  initially  near  one  due  to  the 
randomly  chosen  initial  connection  weight  four-vector  Al°)  [Equation  (13)].  Learning  is  revealed 
by  the  sudden  drop  in  false  alarm  probability;  a  characteristic  which  reflects  the  nonlinear  nature 
of  the  training  algorithm.  The  nonlinearity  appears  in  the  product  *1X2  of  the  predicate  [Equa¬ 
tion  (12)]. 


As  seen  in  Figures  10(b),  11(b),  and  12(a),  unambiguous  transition  detection  for  a |  of  two, 
four,  and  eight  required  window  lengths  of  ten,  five,  and  two,  respectively.  Figure  13  contains  the 
false  alarm  probability  vs  training  iteration  for  10  sample  windows  as  O]  is  varied.  Note  that 
generally  learning  is  observed  in  <100  iterations  with  those  combinations  of  O]  and  N  for  which 
net  connections  converge.  Asymptotic  false  alarm  and  detection  probability  (in  the  limit  of  many 
iterations)  correspond  to  the  optimum  performance  of  the  perceptron.  These  values  are  compared 
with  classical  test  performance  at  optimum  thresholds  in  Section  V. 
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Figure  13.  False  alarm  probability  vs  iteration  number  (NT)  for  10-sample  windows. 
Values  a  I  of  2,  4,  and  8  considered  separately. 
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IV.  MONTE  CARLO  TEST  OF  BACK  PROPAGATION 


Back  propagation  networks  are  a  recently  discovered  modification  of  the  perceptron  which 
involve  only  first-order  neural  units.2  Figure  14  contains  a  back  propagation  network  appropriate 
for  the  stochastic  exclusive-or  mapping.  It  consists  of  two  inputs,  H  middle  layer  neurons,  and  a 
single  output  neuron.  Network  parameters  consist  of  thresholds  {0jji  =  1,2,...,2H  +  4}  at  each  neu¬ 
ron  and  connection  weights  {Wjjji,j  =  1,2,...,2H  +4}  between  neuron  pairs.  The  operation  at  the 
(i.j)  connection  is  to  multiply  the  output  of  the  leftmost  neuron  (j)  of  the  pair  by  W-  before 
input  to  the  rightmost  neuron  (i).  The  total  input  to  a  neuron  i  is  then 

!i=  X  Wij°J  ’  06) 


where  Z  is  the  set  of  neurons  connected  to  i  from  the  left,  and  Oj  is  the  output  of  neuron  j.  The 
output  of  neuron  i  is  given  by 


Figure  14.  Back  propagation  network  for  stochastic  exclusive-or  two-input  neurons,  H  hidden  layer  neurons, 
single  output.  Connection  weights  [w^|i,j  -  1,2.. ..H  +  4}  thresholds  {0,|i  =  1.2 . H  +  4}. 


corresponding  to  nonlinear  threshold  activation  at  dy  The  nonlinear  activation  function  in  Equa¬ 
tion  (17)  is  roughly  analogous  to  the  threshold  which  is  applied  upon  output  of  the  perceptron 
(see  Figure  6). 

The  adaption  of  connection  weights  and  thresholds,  resulting  in  a  match  between  output  and 
target,  is  implemented  by  the  back  propagation  learning  algorithm.  The  derivation  of  the  algo¬ 
rithm,  which  is  based  upon  a  least-squares  minimization,  and  a  discussion  of  numerous  imple¬ 
mentation  issues,  can  be  obtained  in  Reference  2.  For  the  purpose  of  this  report,  an  outline  of 
the  algorithm  will  be  presented  in  order  to  define  required  auxiliary  parameters.  A  training 
ensemble  is  defined  {[X|P\  XiP'>  t(p*]|p  =  l,2,.„,NT}t  consisting  of  input  variance  pairs  [x(,p\  x^*] 
and  an  output  transition  flag  t<P>.  The  parameter  t(P)  is  one  (zero)  if  |[x(|P),  X^]}  corresponds  to 
a  transition  (no  transition)  condition.  The  learning  algorithm  is  defined  recursively  starting  with  a 
randomly  chosen  set  of  initial  connection  weights  and  thresholds  {flj0*}.  Assuming  that 

training  on  {[X|*\  xV'] ,...,  [x(|P  l*-  XiP  l)]}  has  resulted  in  net  parameters  and  {sjP"1^, 

the  input  of  pair  [xjp),  X7P)]  results  in  output  0(P>.  Define  the  output  error 

a<P>  =  [t<P>  -  0<P>]  .  (18) 

and  backward  propagate  error  to  each  neuron  j  by 

giP)  -  o^P-1'  [1  -  0jP‘l)]  £  Sj.p)  wjP-1*  ,  (19) 

k 

where  0*P_1)  was  the  output  of  neuron  j  on  the  (p- 1  )st  iteration.  The  connection  weight  W- 
between  neurons  i  and  j  is  altered  by 

JkWiP'  =  n  <5(p)  Otp'i)  +  a  JkW.<P*l)  ,  (20) 

>J  '  J  i  ij 

where  t)  controls  the  rate  of  connection  weight  modification  and  a  is  a  smoothing  parameter.2 
The  threshold  ^p*1'  for  the  jth  neuron  is  adjusted  on  the  pth  iteration  by 

A0(P,  =  77  5<P)  .  (21) 

J  j 

Note  that  the  back  propagation  learning  algorithm.  Equations  (19)  to  (21),  requires  the  introduc¬ 
tion  of  new  parameters  rj  and  a.  During  training  of  the  network  in  Figure  14  on  stochastic  data, 
the  parameters  were  chosen  to  optimize  performance,  as  measured  by  training  time  and  smooth¬ 
ness  of  convergence. 

Progress  of  the  training  procedure  is  measured  by  the  match  of  output  0<P>  to  target  t<P>  for 
each  element  of  the  training  ensemble.  This  may  be  obtained  through  a  Hamming-type  of 
measure. 


H  =  £  0[|t<P>  -  0<P>|] 
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where 


6*[|  t<P)  -  0<P)|  ]  =  \ 


|t<P)  -  0<P)|  >  r 


lo  |t(P)-0(P)|  <  r  (23) 

with  a  predetermined  match  parameter  r.  Another  training  measure  is  simply  the  rms  difference 
between  the  ensemble  outputs  and  targets. 


NT 

H  =  /  £  [°<p)  -  t(p)]2 

p=l 


(24) 


Figure  15  contains  the  Hamming  and  rms  measures  vs  training  iteration  for  a  40-element  stochas¬ 
tic  exclusive-or  training  ensemble.  The  network  was  that  in  Figure  14  with  64  hidden  units.  Noise 
deviations  o0  and  al  were  one  and  two,  respectively;  and  the  input  windows  contained  20  sam¬ 
ples.  The  match  parameter  r  in  Equation  (23)  was  0.1.  The  training  algorithm  was  run  repeatedly 
on  the  entire  ensemble  of  40  elements  with  alternating  transition  and  no  transition  outputs.  Note 
that  after  ~100  presentations  of  the  ensemble  smooth  convergence  of  both  Hamming  and  rms 
measures  was  interrupted  by  oscillations.  The  source  of  the  training  reversals,  which  appear 
commonly  for  some  a  parameters,  has  yet  to  be  determined. 


Figure  15  Hamming  and  rms  measures  vs  iteration  number  (NT)  for  stochastic  exclusive-or,  o0  ~  1.0,  o ;  =  2.0, 
a  -  0.2,  rj  -  0,3,  64  hidden  layer  neurons,  20-sample  window. 
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The  performance  of  a  trained  back  propagation  network  was  measured  by  Monte  Carlo  sim¬ 
ulation  with  the  aid  of  a  performance  ensemble.  This  consisted  of  a  set  of  NE  variance  pairs 
and  the  corresponding  designator  t<9)  (q  =  1,2. ..,NE).  As  in  the  case  of  perceptron 
modeling,  false  alarm  and  detection  probabilities  were  computed  as  a  function  of  iteration 
number  (as  implemented  on  the  training  ensemble)  from  net  operation  on  the  performance 
ensemble.  A  series  of  experimental  runs  determined  that  the  optimum  number  H  of  hidden  units 
is  sixteen  for  the  stochastic  exclusive-or  problem.  In  contrast  to  the  binary  output  of  a  percep¬ 
tron  network,  the  output  of  the  back  propagation  network  is  analog.  This  requires  the  addition 
of  an  output  match  parameter  r,  which  is  the  maximum  allowed  difference  between  the  output 
and  target  for  hypothesis  determination. 

Figures  16(a-c),  17(a-c),  and  18(a-c)  contain  simulated  Pf  and  Pd  vs  training  iteration  for 
windows  of  5,  10,  and  20  samples,  respectively.  In  all  cases  the  pair  of  noise  deviations  a0  and  oj 
were  1.0  and  2.0,  respectively.  In  each  of  the  simulations  the  smoothing  parameter  a  was  empiri¬ 
cally  chosen  to  optimize  network  performance.  Within  each  Figure  set  the  PpPd  plots  correspond 
to  different  sized  training  ensembles.  For  example.  Figures  16(a),  (b),  and  (c)  contain  false  alarm 
and  detection  probability  for  input  window  lengths  of  5  samples  with  training  ensembles  of  6,  10, 
and  40  elements.  Note  that  for  5  sample  windows  the  size  of  the  training  ensemble  has  little 
effect  on  network  performance  or  training  time.  In  the  case  of  10  sample  windows,  the  use  of  a 
40-element  training  ensemble  halved  the  asymptotic  false  alarm  probability  over  that  of  a  10- 
element  ensemble.  This  comparison  is  obtained  in  Figures  17(a)  and  (c).  Unlike  the  perceptron, 
back  propagation  networks  converged  on  5-sample  windowed  variances.  However,  the  asymptotic 
(Pf,  Pd)  of  about  (0.6,  0.7)  in  Figure  16(c)  is  to  be  compared  with  the  classical  optimum  of  (0.2, 
0.8)  in  Figure  2(b).  The  10-sample  variance  inputs  yield  (Pf,  Pd)  of  about  (0.15,  0.9)  with  a  40- 
elcment  training  ensemble  [Figure  17(c)],  which  is  close  to  the  classical  optimum  values  in 
Figure  2(c). 

A  common  feature  of  back  propagation  networks,  as  revealed  in  Figures  16  and  17,  is  the 
extraordinary  number  of  training  iterations  (of  order  104)  required  to  obtain  network  conver¬ 
gence.  A  partial  remedy  is  to  scale  the  input  training  and  performance  ensemble  variances  to  the 
range  (0,1).  The  magnitudes  of  the  required  connection  weight  adjustments  are  smaller;  untui- 
tively  resulting  in  faster  convergence.  Figures  18(a)  to  (c)  contain  a  set  of  simulations  on  20- 
sample  variances,  in  which  convergence  time  was  reduced  by  a  factor  of  10  through  input  scaling. 
Note,  however,  that  the  asymptotic  false  alarm  probability  of  0.2  is  four  times  the  classical  opti¬ 
mum  of  0.05  in  Figure  2(d).  Because  the  mean  value  of  an  input  variance  is  V^a,  it  is  expected 
that  the  scaling  of  inputs  has  the  greatest  effect  for  large  N;  a  limit  which  corresponds  to  a  large 
contraction  of  the  map  input  domain. 
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Figure  16.  False  alarm  and  detection  probability  vs  iteration  number;  a0  =  1.0,  Oj  =  2.0,  H  -  16,  a  -  0.4. 
r\  -  0.8,  5-sample  window,  (a)  4-layer  network,  6-element  training  ensemble;  (b)  4-layer  network.  1 0-element 
training  ensemble;  and  (c)  4-layer  network,  40-element  training  ensemble. 
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Figure  17.  False  alarm  and  detection  probability  vs  iteration  number;  o0=  1.0.  o/  =  2.0.  H  -  16.  a  -  0.3, 
rj  -0.3.  10-sample  window,  (a)  4-layer  network.  6-element  training  ensemble;  (b)  4-layer  network,  10-element 
training  ensemble;  and  (c)  4-laver  network,  40-element  training  ensemble. 


V.  COMPARISON  OF  NEURAL  NETS 
AND  CLASSICAL  PERFORMANCE 


A  specific  goal  of  this  report  is  a  quantitative  comparison  of  neural  net  and  classical  transi¬ 
tion  detection.  A  desirable  feature  of  transition  detection  is  the  existence  of  a  classical  hypothesis 
test  which  forms  the  basis  of  the  comparison. 

For  the  case  of  the  perceptron,  the  asymptotic  values  of  Pf  and  Pd  in  Figures  IO(a-d)  to 
1 2(a-d)  are  applied  for  the  comparison  with  Neyman-Pearson  testing.  These  probabilities  are  an 
average  of  Pf  and  Pd  over  a  large  number  of  iterations  past  the  point  in  which  the  net  has 
learned  the  map.  Note  that  for  small  values  of  a j  and/or  N,  the  mean  probabilities  have  large 
variances.  This  is  observed  in  the  large  iteration  number  limit  of  Figures  10(a)  and  (b);  corre¬ 
sponding  to  o  [  of  two  and  windows  of  length  two  and  five,  respectively.  The  effect  is  due  to 
input  noise  as  reflected  in  uncertainty  of  the  final  connection  vector  A  for  the  net.  Note  that  var¬ 
iances  in  ensemble-derived  probabilities  vanish  rapidly  for  higher  values  of  o\  and/or  larger  win¬ 
dow  length. 

The  asymptotic  false  alarm  and  detection  probabilities  of  each  network  were  computed  from 
ensemble-derived  values  in  the  limit  of  many  training  iterations.  Figures  19  to  22  contain  plots  of 
the  asymptotic  perceptron  and  optimum  classical  values  of  Pf  and  Pd  as  a  function  of  window 
length  N.  The  Neyman-Pearson  optimum  thresholds  were  obtained  from  Figures  2(a-d)  to  4(a-d). 
Standard  deviations  ctj  of  two  and  four  were  considered  separately.  The  shaded  region  on  each 
plot  represents  the  range  of  false  alarm  and  detection  probabilities  obtained  from  computed  var¬ 
iances  in  these  values.  The  thick  curve  represents  the  classical  optimum  as  a  function  of  the  win¬ 
dow  length.  Note  that  in  all  cases  the  averaged  perceptron  performance  was  close  to  the  classical 
optimum,  and  the  devia,.on  of  the  probabilities  vanished  at  window  lengths  of  twenty. 

The  performance  of  back  propagation  networks  was  considered  with  a]  of  two  for  various 
window  lengths.  The  simulations  in  Figures  16(a-c)  to  18(a-c)  contain  net  performance  curves 
with  6-,  10-.  and  40-element  training  ensembles  for  each  (oj,  N)  pair.  In  Figures  17(a-c)  and  19 
performance  is  weakly  dependent  on  ensemble  size;  suggesting  that  for  sufficiently  large  windows 
( N  ^  10)  distribution  sampling  is  not  a  difficult  problem.  This  is  expected  because  the  x}- 
distributions  which  form  the  input  variances  are  not  sufficiently  overlapped  to  require  dense 
sampling. 

In  Figures  I6(a-c)  it  is  observed  that  network  weights  and  thresholds  converged  for  all  5- 
sample  window  ensembles.  The  simulation  in  Figure  10(b)  reveals  that  the  perceptron  failed  to 
converge  for  5-sample  input  variances.  The  larger  number  of  adjustable  parameters  in  the  back 
propagation  network  may  result  in  greater  discrimination  of  overlapped  distributions.  Note,  how¬ 
ever,  that  while  the  asymptotic  Pd  of  0.74  is  near  the  classical  value  of  0.78  [Figure  2(b)],  a  Pf  of 
0.54  is  more  than  twice  the  classical  optimum  of  0.22  [Figure  2(b)]. 


The  performance  of  the  networks  on  larger  window  input  variances  is  significantly  closer  to 
the  classical  optimum.  Figures  17(c)  and  18(c)  have  asymptotic  (Pf,  Pd)  pairs  of  (0.16,  0.93)  and 
(0.2,  0.93)  for  10-  and  20-sample  inputs,  respectively.  These  are  to  be  compared  with  probabilities 
of  (0.12,  0.88)  and  (0.04,  0.96)  for  the  optimum  Neyman-Pearson  test  with  10-  and  20-sample 
inputs  [Figures  10(c)  and  (d)]. 
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Figure  20.  Percepiron  and  classical  detection  probability  vs  window  length  (N);  Oq-  1.0,  o  /  ~  2.0:  shaded 
region  represents  range  of  percepiron  performance  from  ensemble-derived  probabilities;  thickened  curve  repre¬ 
sents  classical  performance. 
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Figure  21.  Percepiron  and  classical  false  alarm  probability  vj  w  indow  length  (N);  Oq  -  1.0,  o /-  4.0;  shaded 
region  represents  range  of  percepiron  performance  from  ensemble-derived  probabilities:  thickened  curve  repre¬ 
sents  classical  performance. 
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Figure  22.  Percepiron  and  classical  detection  probability  vs  window  length  (N):  oq~  1.0,  o j  -  4.0:  shaded 
region  represents  range  of  percepiron  performance  from  ensemble-derived  probabilities:  thickened  curve  repre¬ 
sents  classical  performance. 


VI.  CONCLUSION 


This  report  analyzed  the  performance  of  perceptron  and  back  propagation  neural  nets  on  the 
stochastic  exciusive-or  problem.  The  test  consists  of  the  detection  of  a  variance  transition  in 
Gaussian  data  from  input  of  sample  variances  before  and  after  the  postulated  transition.  It  is  a 
prototype  test;  applications  of  which  include  maneuver  detection  from  sensor  data.  For  maneuver 
detection  any  data-derived  sufficient  statistic  sensitive  to  target  motion  may  be  input  to  the  net¬ 
work.  For  example,  satellite  maneuver  detection  has  been  performed  by  neural  nets  in  which  the 
input  parameters  are  correlation  lengths  in  the  radar  cross  section.  The  only  requirements  for  a 
statistic  are  parameter  discrimination  in  the  hypothesis  domain  and  repeatability  in  the  data. 

As  revealed  in  the  previous  sections,  hypothesis  testing  Neyman-Pearson  and  neural  net  algo¬ 
rithms  are  radically  different.  Neural  net  testing  requires  a  representative  training  ensemble. 
Although  classical  techniques  are  defined  a  priori ,  they  are  model  dependent  and  optimum  only 
at  a  particular  threshold.  Often  the  correct  threshold  is  identifiable  only  for  a  stationary  process, 
or  if  the  no nstationarity  is  relatively  simple  (such  as  a  variance  transition).  For  this  reason  neural 
net  decisioning  may  be  more  robust  to  noise  properties.  It  should  be  noted,  however,  that  neural 
net  training  requires  the  training  ensemble  to  be  representative  of  the  continuous  probability  den¬ 
sities  which  describe  the  stochastic  process.  The  problems  of  distribution  sampling  are  likely  to 
plague  any  ensemble-dependent  learning  algorithm. 

In  summary,  it  appears  that  for  sufficiently  discriminated  input  distributions  a  trained  net¬ 
work  performs  near  the  Neyman-Pearson  optimum.  For  example,  Figures  19  to  22  show  percep¬ 
tron  convergence  to  the  optimum  in  <20  iterations.  Important  considerations  for  back  propaga¬ 
tion  networks  are  the  requirements  of  a  large  number  of  training  iterations  (~104  for  stochastic 
exclusive-or),  and  the  tuning  of  training  algorithm  parameters.  The  required  number  of  training 
iterations  results  from  the  large  number  of  independent,  adjustable  parameters  (connection 
weights,  thresholds)  in  the  network.  This  feature  may  aid  net  discrimination,  as  observed  in  net¬ 
work  convergence  for  five  sample  input  variances.  The  back  propagation  learning  algorithm 
requires  three  added  parameters;  r]  for  convergence  rate,  a  for  smoothing,  and  r  to  control 
matching  to  binary  output.  A  certain  amount  of  experimentation  was  required  to  tune  the 
parameters  for  optimum  net  performance. 
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APPENDIX 

CLASSICAL  TRANSITION  DETECTION 

In  this  appendix  Equations  (4)  and  (5)  reiating  detection  and  false  alarm  probabilities  to  the 
quantities  {p(i|j)  |  i,j  t  (0,1)}  are  derived.  Recall  that  the  index  zero  or  one  corresponds  to  noise 
deviation  o0  or  oj,  respectively.  The  pair  (i,j)  denotes  a  transition  from  deviation  Oj  to  deviation 
Oj,  and  the  expression  p(x|  y)  denotes  the  probability  of  x  detection  conditioned  on  y.  The  rele¬ 
vant  probabilities  are  then  given  by 

Pd  =  p(transition|  transition)  (A-f 

and 

Pf  =  p(transition|no  transition)  ,  (A-2^ 

for  detection  and  false  alarm.  The  expression  in  Equation  (A-l)  is  given  by 

p(transition|  transition)  =  p[(l,0)|  transition]  +  p[(0,l)|  transition]  (A-3' 

The  relationship  in  Bayes  Theorem, 


P(x|y)  = 


P(x,y) 


where  p(x,y)  represents  the  joint  probability  of  x  and  y,  can  be  applied  to  Equation  (A-3)  with 
the  result 

p[(0,l),  transition]  +  p[(  1 ,0),  transition] 

p(transition|  transition)  =  — -  (A-5) 

p(transition) 

where  p(transition)  represents  the  prior  probability  of  a  transition.  A  transition  is  obtained  by 
either  a  (1,0)  or  a  (0,1)  noise  deviation  pair.  Equation  (A-5)  can  be  written  in  terms  of  the  pro¬ 
babilities  for  specific  deviation  pair  detection  with  the  result 

P[(0, 1),  (1,0)]  +  p[(0,l),  (0,1)]  +  p[(  1 ,0),  (l,0)]  +  p[(I,0),  (0,1)] 

p(transition|  transition)  =  -  (A-6) 

p[(l,0)]  +  p[(0,!)] 

where  p[(i,_j)]  represents  the  prior  probability  of  a  deviation  pair  (i,j).  Application  of  Bayes  Theo¬ 
rem.  Equation  (A-4),  results  in  the  expression 

( p[(0, 1)|(1 ,0)]  +  p[(  1 ,0)| ( 1 ,0)])  p[(l,0)] 

p(transition|transition)  =  - 

p[(l,0)]  +  p[(0,l)] 

(p[(0, 1 )| (0, 1 )]  +  p[(  1 ,0)1  (0, 1 )] )  p[(0,I)] 

+  A - L -  (A-7) 

p[(I,0)]  +  p[(0,l)] 


s. 


A  .  v  •  .  Vv- " 


yji 


Wi 


Recall  that  p[(i,k)|(k,l)]  represents  the  detection  of  deviation  pair  (i,j)  conditioned  on  the  pair 
(k,l).  Assuming  that  the  decision  for  this  occurrence  is  based  upon  a  pair  of  maximum  likelihood 
hypothesis  tests  before  and  after  the  transition,  the  conditional  probabilities  factorize;  that  is, 

p[(i,j)|(k,l)]  =  p(i|  k)  p(j|l)  ( A-8) 

Substitution  of  Equation  (A-8)  into  the  conditional  probabilities  in  Equation  (A-7)  results  in  the 
expression 

p(transition| transition)  =  [p(l|  1)  p(0|0)  +  p(0|  1)  p(l|0)]  ,  (A-9) 

where  p(i|j)  is  given  in  Equations  (7)  and  (8).  It  is  interesting  that  prior  probabilities  p[(i,j)]  have 
canceled  from  Equation  (A-7);  indicating  a  detection  probability  independent  of  the  prior  distri¬ 
bution  of  deviation  pairs. 

The  same  argument  applied  to  the  false  alarm  probability  in  Equation  (A-2)  results  in  the 
expression 


„  ...  ,  .  .  ,  2  {p(I|l)  p(0|l)  p[(l,l)]+p(l|0)  p(0|0)  p[(0,0)]} 

p(transition  no  transition)  =  - - - - - 

(p[(0,0)]  +  p[(l,D]} 


(A- 10) 


In  this  case  the  probability  depends  upon  the  prior  probabilities  p[(0,0)]  and  p[(l,I)]  for  the 
ensemble  upon  which  the  hypothesis  test  is  applied.  An  ensemble  in  which  deviation  pairs  (0,0) 
and  (1,1)  are  equally  probable  results  in 

p(transition|  no  transition)  =  [p(i|  1)  p(0|  I)  +  p(0|0)  p(l|0)]  (A-ll) 

Another  interesting  limit 

p[(l,l)]«  p[(0,0)]  ,  (A-12) 


corresponds  to  a  high  noise  environment  much  less  likely  than  a  low  noise  environment.  In  this 
case  Equation  (A- 10)  reduces  to 

p(transition|no  transition)  =  2  p(  1 1 0)  p(0|0)  ,  (A-13) 

independent  of  the  probability  p[(0,0)]. 
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